Multilingual Sentence Categorization according to Language

نویسنده

  • Emmanuel Giguet
چکیده

Issues in sentence categorization according to language is fundamental for NLP, especially in document processing. In fact, with the growing amount of multilingual text corpus data becoming available, sentence categorization, leading to multilingual text structure, opens a wide range of applications in multilingual text analysis such as information retrieval or preprocess-ing of multilingual syntactic parser. The major difficulties in sentence catego-rization are convergence and textual errors. Convergence since dealing with short entries involve discarding languages from few clues. Textual errors since documents coming from different electronic ways may contain spelling and grammatical errors as well as character recognition errors generated by OCR. We describe here an approach to sentence categorization which has the originality to be based on natural properties of languages with no training set dependency. The implementation is fast, small, robust and tex-tual errors tolerant. Tested for french, en-glish, spanish and german discrimination, the system gives very interesting results, achieving in one test 99.4% correct assignments on real sentences. The resolution power is based on grammatical words (not the most common words) and alphabet. Having the grammatical words and the alphabet of each language * This Paper is published in the Proceedings of the European Chapter of the Association for Computational Linguistics SIGDAT Workshop " From text to tags : Issues in Multilingual Language Analysis " held March 95 in Dublin. at its disposal, the system computes for each of them its likelihood to be selected. The name of the language having the optimum likelihood will tag the sentence — but non resolved ambiguities will be maintained. We will discuss the reasons which lead us to use these linguistic facts and present several directions to improve the system's classification performance. Categorization sentences with linguistic properties shows that difficult problems have sometimes simple solutions. Emergence of text categorization according to language came with the need of processing texts coming from all over the world. The goal of text categoriza-tion is to tag texts with the name of the language in which they are written. Information retrieval is the main application field. To do this job, the traditionnal way is to exploit the difference between letter combinations in different languages (Cavnar and Trenkle, 1994). For each language, the system computes from a training set a profile based on frequency (or probability) of letter sequences. Then, for a given text, it computes a profile and select the language which has the …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Sentence Categorization According to Language 1 Categorization According to Language 1.1 from Text Categorization

Issues in sentence categorization according to language is fundamental for NLP, especially in document processing. In fact, with the growing amount of multilingual text corpus data becoming available , sentence categorization, leading to multilingual text structure, opens a wide range of applications in multilingual text analysis such as information retrieval or preprocessing of multilingual sy...

متن کامل

Dynamic Categorization of Semantics of Fashion Language: A Memetic Approach

Categories are not invariant. This paper attempts to explore the dynamic nature of semantic category, in particular, that of fashion language, based on the cognitive theory of Dawkins’ memetics, a new theory of cultural evolution. Semantic attributes of linguistic memes decrease or proliferate in replication and spreading, which involves a dynamic development of semantic category. More specific...

متن کامل

MUSE – A Multilingual Sentence Extractor

MUltilingual Sentence Extractor (MUSE) is aimed at multilingual single-document summarization. MUSE implements the supervised language-independent summarization approach based on optimization of multiple statistical sentence ranking methods. The MUSE tool consists of two main modules: the training module activated in the offline mode, and the on-line summarization module. The training module ca...

متن کامل

Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization

Cross-language Text Categorization is the task of assigning semantic classes to documents written in a target language (e.g. English) while the system is trained using labeled documents in a source language (e.g. Italian). In this work we present many solutions according to the availability of bilingual resources, and we show that it is possible to deal with the problem even when no such resour...

متن کامل

Document Categorization using Multilingual Associative Networks based on Wikipedia

Associative networks are a connectionist language model with the ability to categorize large sets of documents. In this research we combine monolingual associative networks based on Wikipedia to create a larger, multilingual associative network, using the cross-lingual connections between Wikipedia articles. We prove that such multilingual associative networks perform better than monolingual as...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره cmp-lg/9502039  شماره 

صفحات  -

تاریخ انتشار 1995